Medical Charges Analysis

Fatimah Alawad

10/3/2021

What factors affect the medical charges?

The dataset is from kaggle and it contains the following variables:

Variable Description
age age of primary beneficiary
sex insurance contractor gender, female, male
bmi Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,objective index of body weight (kg / m ^2) using the ratio of height to weight, ideally 18.5 to 24.9
children Number of children covered by health insurance / Number of dependents
smoker Smoking
region the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
charges Individual medical costs billed by health insurance
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()

Exploring the variables

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   27.00   39.00   39.21   51.00   64.00

From the graph, we can see that the charges increase as age goes up.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.095   2.000   5.000

This plot illustrates that families with 4 and 5 children have less charges than families with less children(which is weird).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.96   26.30   30.40   30.66   34.69   53.13

From this plot, we can see that as bmi increases, charges increases.

## 
## female   male 
##    662    676

As we can guess, there is no much differences between the insurance charges of males and females.

## 
##   no  yes 
## 1064  274

We can see that smokers have higher charges than non smoker, which tells that smoking may have a negative impact on smokers’ health!!

Building a model that predict the medical charges

## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region + 
##     sex, data = insurance)
## 
## Coefficients:
##     (Intercept)              age              bmi         children  
##        -11938.5            256.9            339.2            475.5  
##       smokeryes  regionnorthwest  regionsoutheast  regionsouthwest  
##         23848.5           -353.0          -1035.0           -960.1  
##         sexmale  
##          -131.3
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region + 
##     sex, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11304.9  -2848.1   -982.1   1393.9  29992.8 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11938.5      987.8 -12.086  < 2e-16 ***
## age                256.9       11.9  21.587  < 2e-16 ***
## bmi                339.2       28.6  11.860  < 2e-16 ***
## children           475.5      137.8   3.451 0.000577 ***
## smokeryes        23848.5      413.1  57.723  < 2e-16 ***
## regionnorthwest   -353.0      476.3  -0.741 0.458769    
## regionsoutheast  -1035.0      478.7  -2.162 0.030782 *  
## regionsouthwest   -960.0      477.9  -2.009 0.044765 *  
## sexmale           -131.3      332.9  -0.394 0.693348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6062 on 1329 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7494 
## F-statistic: 500.8 on 8 and 1329 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Coefficients:
##     (Intercept)              age              bmi         children  
##        -11990.3            257.0            338.7            474.6  
##       smokeryes  regionnorthwest  regionsoutheast  regionsouthwest  
##         23836.3           -352.2          -1034.4           -959.4
## 
## Call:
## lm(formula = charges ~ age + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16

Chicking the model

The Residual versus Fitted plot shows that there is a concern that the relationship is non-linear.

The Normal Q-Q plot shows that the residuals are not normally distributed.

The Scale - Location plot shows that the assumption of equal variance is satisfied since the points are randomly distributed except the lower left points.

From Residuals vs Leverage plot, we can see that observation 1048 could be a potential influential observation.

#Since there is a concern that the relationship is non-linear, I can transform the age variable to the age square since the graph of age is not linear.

## 
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region, 
##     data = insurance)
## 
## Coefficients:
##     (Intercept)              age              bmi         children  
##        -11990.3            257.0            338.7            474.6  
##       smokeryes  regionnorthwest  regionsoutheast  regionsouthwest  
##         23836.3           -352.2          -1034.4           -959.4
## 
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region, 
##     data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11367.2  -2835.4   -979.7   1361.9  29935.5 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -11990.27     978.76 -12.250  < 2e-16 ***
## age                256.97      11.89  21.610  < 2e-16 ***
## bmi                338.66      28.56  11.858  < 2e-16 ***
## children           474.57     137.74   3.445 0.000588 ***
## smokeryes        23836.30     411.86  57.875  < 2e-16 ***
## regionnorthwest   -352.18     476.12  -0.740 0.459618    
## regionsoutheast  -1034.36     478.54  -2.162 0.030834 *  
## regionsouthwest   -959.37     477.78  -2.008 0.044846 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6060 on 1330 degrees of freedom
## Multiple R-squared:  0.7509, Adjusted R-squared:  0.7496 
## F-statistic: 572.7 on 7 and 1330 DF,  p-value: < 2.2e-16

Now we can see that, The Residual versus Fitted plot shows that the relationship is linear.

The Normal Q-Q plot shows that the residuals are not normally distributed.

The Scale - Location plot shows that the there are two clusters, but assumption of equal variance is satisfied since the points are randomly distributed around the line.

From Residuals vs Leverage plot, we can see that there are many obversations could be potential influential observations.

To improve the model, we will check if there is an interaction between the explanatory variables.

## Registered S3 methods overwritten by 'parameters':
##   method                           from      
##   as.double.parameters_kurtosis    datawizard
##   as.double.parameters_skewness    datawizard
##   as.double.parameters_smoothness  datawizard
##   as.numeric.parameters_kurtosis   datawizard
##   as.numeric.parameters_skewness   datawizard
##   as.numeric.parameters_smoothness datawizard
##   print.parameters_distribution    datawizard
##   print.parameters_kurtosis        datawizard
##   print.parameters_skewness        datawizard
##   summary.parameters_kurtosis      datawizard
##   summary.parameters_skewness      datawizard
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
## 
## Call:
## lm(formula = charges ~ bmi * smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19768.0  -4400.7   -869.5   2957.7  31055.9 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     5879.42     976.87   6.019 2.27e-09 ***
## bmi               83.35      31.27   2.666  0.00778 ** 
## smokeryes     -19066.00    2092.03  -9.114  < 2e-16 ***
## bmi:smokeryes   1389.76      66.78  20.810  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6161 on 1334 degrees of freedom
## Multiple R-squared:  0.7418, Adjusted R-squared:  0.7412 
## F-statistic:  1277 on 3 and 1334 DF,  p-value: < 2.2e-16

We can see that the interaction effect is statistically significant. From the plot, we can see that there is interaction between bmi and smoker status.

Adding the interaction between bmi and smoker status to the model

## 
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region + 
##     bmi * smoker, data = insurance)
## 
## Coefficients:
##     (Intercept)              age              bmi         children  
##        -2453.56           264.04            22.61           512.71  
##       smokeryes  regionnorthwest  regionsoutheast  regionsouthwest  
##       -20309.09          -581.70         -1207.01         -1227.60  
##   bmi:smokeryes  
##         1438.11
## 
## Call:
## lm(formula = charges ~ age^2 + bmi + children + smoker + region + 
##     bmi * smoker, data = insurance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14655.4  -1918.9  -1313.4   -489.7  30333.1 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -2453.564    857.695  -2.861  0.00429 ** 
## age                264.042      9.522  27.729  < 2e-16 ***
## bmi                 22.615     25.620   0.883  0.37756    
## children           512.713    110.266   4.650 3.65e-06 ***
## smokeryes       -20309.092   1648.861 -12.317  < 2e-16 ***
## regionnorthwest   -581.704    381.215  -1.526  0.12727    
## regionsoutheast  -1207.011    383.109  -3.151  0.00167 ** 
## regionsouthwest  -1227.601    382.576  -3.209  0.00136 ** 
## bmi:smokeryes     1438.108     52.630  27.325  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4851 on 1329 degrees of freedom
## Multiple R-squared:  0.8405, Adjusted R-squared:  0.8395 
## F-statistic: 875.4 on 8 and 1329 DF,  p-value: < 2.2e-16

We can see that the Adjusted R-squared is higher!

Determine the prediction errors

Validation set approach,

## [1] 1100    7
## [1] 29231798

The test error for the regression model for charges on age^2, BMI, children, smoker status, BMI*smoker, and region based on the validation set approach is 28718309.

K- fold corss validation approach

## [1] 23639330

The test error for the regression model for charges on age^2, BMI, children, smoker status, BMI*smoker, and region based on the 5-fold cross-validation approach is 23639330.

We can see that the 5-fold cross-validation approach produces a lower error than the validation set approach.

LOOCV approach

## [1] "call"  "K"     "delta" "seed"
## [1] 23712836

The LOOCV approach produces a lower error than the validation set approaches and a higher error than the 5-fold cross-validation approach.